These are unofficial data splits for the corpus MADCAT Chinese Pilot Training Set (LDC2014T13). 
LDC is providing only training data for this corpus and not the original dev/eval sets, so the original
training data have been split into three different disjoint parts (i.e. there shouldn't be sentences/lines 
from the same document in different sets -- as each document is handwritten/transcribed 
by a different author in the MADCAT data) to allow for evaluation of the performance in the usual way.
<p/>
Also, please not that the license relates only for the splits. You still need to obtain the original databases
and respect the databases' license!
<p/>
It contains the madcat xml name and segment id (s{1,2,3,4}). For example:
<pre>
	GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s1
	GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s2
	GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s3
</pre>

